---
id: metrics-summarization
title: Summarization
sidebar_label: Summarization
---

<head>
  <link
    rel="canonical"
    href="https://deepeval.com/docs/metrics-summarization"
  />
</head>

import Equation from "@site/src/components/Equation";
import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";

<MetricTagsDisplayer referenceless={true} />

The summarization metric uses LLM-as-a-judge to determine whether your LLM (application) is generating factually correct summaries while including the necessary details from the original text. In a summarization task within `deepeval`, the original text refers to the `input` while the summary is the `actual_output`.

:::note
The `SummarizationMetric` is the only default metric in `deepeval` that is not cacheable.
:::

## Required Arguments

To use the `SummarizationMetric`, you'll have to provide the following arguments when creating an [`LLMTestCase`](/docs/evaluation-test-cases#llm-test-case):

- `input`
- `actual_output`

Read the [How Is It Calculated](#how-is-it-calculated) section below to learn how test case parameters are used for metric calculation.

## Usage

Let's take this `input` and `actual_output` as an example:

```python
# This is the original text to be summarized
input = """
The 'coverage score' is calculated as the percentage of assessment questions
for which both the summary and the original document provide a 'yes' answer. This
method ensures that the summary not only includes key information from the original
text but also accurately represents it. A higher coverage score indicates a
more comprehensive and faithful summary, signifying that the summary effectively
encapsulates the crucial points and details from the original content.
"""

# This is the summary, replace this with the actual output from your LLM application
actual_output="""
The coverage score quantifies how well a summary captures and
accurately represents key information from the original text,
with a higher score indicating greater comprehensiveness.
"""
```

You can use the `SummarizationMetric` as follows for [end-to-end](/docs/evaluation-end-to-end-llm-evals) evaluation:

```python
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import SummarizationMetric
...

test_case = LLMTestCase(input=input, actual_output=actual_output)
metric = SummarizationMetric(
    threshold=0.5,
    model="gpt-4",
    assessment_questions=[
        "Is the coverage score based on a percentage of 'yes' answers?",
        "Does the score ensure the summary's accuracy with the source?",
        "Does a higher score mean a more comprehensive summary?"
    ]
)

# To run metric as a standalone
# metric.measure(test_case)
# print(metric.score, metric.reason)

evaluate(test_cases=[test_case], metrics=[metric])
```

There are **NINE** optional parameters when instantiating an `SummarizationMetric` class:

- [Optional] `threshold`: the passing threshold, defaulted to 0.5.
- [Optional] `assessment_questions`: a list of **close-ended questions that can be answered with either a 'yes' or a 'no'**. These are questions you want your summary to be able to ideally answer, and is especially helpful if you already know what a good summary for your use case looks like. If `assessment_questions` is not provided, we will generate a set of `assessment_questions` for you at evaluation time. The `assessment_questions` are used to calculate the `coverage_score`.
- [Optional] `n`: the number of assessment questions to generate when `assessment_questions` is not provided. Defaulted to 5.
- [Optional] `model`: a string specifying which of OpenAI's GPT models to use, **OR** [any custom LLM model](/docs/metrics-introduction#using-a-custom-llm) of type `DeepEvalBaseLLM`. Defaulted to 'gpt-4.1'.
- [Optional] `include_reason`: a boolean which when set to `True`, will include a reason for its evaluation score. Defaulted to `True`.
- [Optional] `strict_mode`: a boolean which when set to True, enforces a strict evaluation criterion. In strict mode, the metric score becomes binary: a score of 1 indicates a perfect result, and any outcome less than perfect is scored as 0. Defaulted as `False`.
- [Optional] `async_mode`: a boolean which when set to `True`, enables [concurrent execution within the `measure()` method.](/docs/metrics-introduction#measuring-metrics-in-async) Defaulted to `True`.
- [Optional] `verbose_mode`: a boolean which when set to `True`, prints the intermediate steps used to calculate said metric to the console, as outlined in the [How Is It Calculated](#how-is-it-calculated) section. Defaulted to `False`.
- [Optional] `truths_extraction_limit`: an int which when set, determines the maximum number of factual truths to extract from the `input`. The truths extracted will used to determine the `alignment_score`, and will be ordered by importance, decided by your evaluation `model`. Defaulted to `None`.

:::note
Sometimes, you may want to only consider the most important factual truths in the `input`. If this is the case, you can choose to set the `truths_extraction_limit` parameter to limit the maximum number of truths to consider during evaluation.
:::

### Within components

You can also run the `SummarizationMetric` within nested components for [component-level](/docs/evaluation-component-level-llm-evals) evaluation.

```python
from deepeval.dataset import Golden
from deepeval.tracing import observe, update_current_span
...

@observe(metrics=[metric])
def inner_component():
    # Set test case at runtime
    test_case = LLMTestCase(input="...", actual_output="...")
    update_current_span(test_case=test_case)
    return

@observe
def llm_app(input: str):
    # Component can be anything from an LLM call, retrieval, agent, tool use, etc.
    inner_component()
    return

evaluate(observed_callback=llm_app, goldens=[Golden(input="Hi!")])
```

### As a standalone

You can also run the `SummarizationMetric` on a single test case as a standalone, one-off execution.

```python
...

metric.measure(test_case)
print(metric.score, metric.reason)
```

:::caution
This is great for debugging or if you wish to build your own evaluation pipeline, but you will **NOT** get the benefits (testing reports, Confident AI platform) and all the optimizations (speed, caching, computation) the `evaluate()` function or `deepeval test run` offers.
:::

## How Is It Calculated?

The `SummarizationMetric` score is calculated according to the following equation:

<Equation formula="\text{Summarization} = \min(\text{Alignment Score}, \text{Coverage Score})" />

To break it down, the:

- `alignment_score` determines whether the summary contains hallucinated or contradictory information to the original text.
- `coverage_score` determines whether the summary contains the necessary information from the original text.

While the `alignment_score` is similar to that of the [`HallucinationMetric`](/docs/metrics-hallucination), the `coverage_score` is first calculated by generating `n` closed-ended questions that can only be answered with either a 'yes or a 'no', before calculating the ratio of which the original text and summary yields the same answer. [Here is a great article](https://www.confident-ai.com/blog/a-step-by-step-guide-to-evaluating-an-llm-text-summarization-task) on how `deepeval`'s summarization metric was build.

You can access the `alignment_score` and `coverage_score` from a `SummarizationMetric` as follows:

```python
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(...)
metric = SummarizationMetric(...)

metric.measure(test_case)
print(metric.score)
print(metric.reason)
print(metric.score_breakdown)
```

:::note
Since the summarization score is the minimum of the `alignment_score` and `coverage_score`, a 0 value for either one of these scores will result in a final summarization score of 0.
:::
